## [1] "/Users/adarshnair/Desktop/iPythonNotebook/DAND_P4"
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
There are 4898 observations with 12 features. Descriptions of the features can be found here - https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityInfo.txt
I am interested in analysing the Quality variable which has a scale of 0-10 with 10 being the highest quality white wine. The quality of wines in this dataset have a range of [3,9] with mean value of 5.878 and median of 6. I will factor the quality variable -
##
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
From this factoring we can see that most white wines have 5,6 and 7 quality scores. To get a general idea of the data we have, I will generate histogram plots for all the features.
As we can see, most of the distributions are normal, with a few skewed to the left. To check for outliers in the data, visualising the same data using boxplots will help.
There seem to outliers spread throughout the feature values for white wine, but at this this point it is hard to say if that is due to the dataset or because those are the actual values for those white wines.
I create a new quality rating based on 3 levels, ‘Good’, ‘Average’ and ‘Mediocre’.
wv$rating <- ifelse(wv$quality > 7, 'Good', ifelse(wv$quality <= 4, 'Mediocre', 'Average'))
I then order the ratings.
wv$rating <- ordered(wv$rating, levels = c('Mediocre', 'Average', 'Good'))
Here is the summary of the ratings variable:
summary(wv$rating)
## Mediocre Average Good
## 183 4535 180
Visualizationg of the wine rating categories:
I now perform the scaling of Volatile Acidity, Citric Acid, Chlorides and Free Sulfur Dioxide to their log values:
There are 4898 observations with 12 features. The features are: fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol and quality. The description of the attributes is as follows: Description of attributes:
1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
3 - citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines
4 - residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
5 - chlorides: the amount of salt in the wine
6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content
9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, which acts as an antimicrobial and antioxidant
11 - alcohol: the percent alcohol content of the wine
Output variable (based on sensory data): 12 - quality (score between 0 and 10)
I am interested in analysing the Quality variable which has a scale of 0-10 with 10 being the highest quality white wine. The quality of wines in this dataset have a range of [3,9] with mean value of 5.878 and median of 6. ### What other features in the dataset do you think will help support your investigation into your feature(s) of interest? I used this link(http://winefolly.com/review/understanding-acidity-in-wine/) to understand the structure of wines and how to asses their quality. Based on that analysis, the acidity(wv\(pH), sweetness(wv\)residual.sugar), alcohol content(wv$alcohol) are the main driving factors to understanding and assessing the quality of wine.
To better understand the rating of the quality of the wine, I have classified the wine ratings into three categories: 0-4 is ‘Mediocre’, 5-7 is ‘Average’ and 8-9 is ‘Good’. I have used the official wine 100 point rating scale as inspiration. Based on that metric we have 183 Mediocre, 4535 Average and 180 Good category wines.
To get a better scale on the values of Volatile Acidity, Citric Acid, Chlorides and Free Sulfur Dioxide, I converted them to their log values.
To perfom my bivariate analysis I create plots to check if there are any clear visible relationships between certain features I have a predisposition to thinking have relationships.
I start by exploring the relationship between Quality and pH
ggplot(aes(x = wv$quality, y = wv$pH),
data = wv) +
geom_point(alpha = 1/5, position = position_jitter(h = 0))+
xlab('Quality of wine') +
ylab('pH of wine')
Exploring the relationship between Quality and Sweetness
ggplot(aes(x = wv$quality, y = wv$residual.sugar),
data = wv) +
geom_point(alpha = 1/5, position = position_jitter(h = 0)) +
xlab('Quality of wine') +
ylab('Residual Sugar')
Exploring the relationship between Quality and Alcohol
ggplot(aes(x = wv$quality, y = wv$alcohol),
data = wv) +
geom_point(alpha = 1/5, position = position_jitter(h = 0)) +
xlab('Quality of wine') +
ylab('Alcohol content')
Exploring the relationship between Quality and Volatile Acidity
ggplot(aes(x = wv$quality, y = wv$volatile.acidity),
data = wv) +
geom_point(alpha = 1/5, position = position_jitter(h = 0)) +
xlab('Quality of wine') +
ylab('Volatile Acidity')
Noticing that it is hard to come to a real conclusion based on these bivariate graphs, I now explore their corresponding r scores (Exploring the correlation values between Quality and our other features.)
I now create 3 plots to understand their relationship better by studying how their quantile relationships.
In the next 3 plots, the blue line is the 10% quantile line, the red line in the 90% quantile line and the green line is the 50% quantile line, while the black line is the mean value line.
Fixed acidity and Volatile acidity:
I analyse the r score:
with(wv, cor.test(as.numeric(fixed.acidity), as.numeric(volatile.acidity), method = 'pearson'))
Fixed acidity and citric acid
I analyse the r score:
with(wv, cor.test(as.numeric(fixed.acidity), as.numeric(citric.acid), method = 'pearson'))
Volatile acidity and citric acid
I analyse the r score:
with(wv, cor.test(as.numeric(volatile.acidity), as.numeric(citric.acid), method = 'pearson'))
Free sulfur dioxide and total sulfur dioxide
I then analyse the r score
with(wv, cor.test(as.numeric(free.sulfur.dioxide), as.numeric(total.sulfur.dioxide), method = 'pearson'))
Fixed acidity and pH
I analyse the r score:
with(wv, cor.test(as.numeric(fixed.acidity), as.numeric(pH), method = 'pearson'))
Volatile acidity and pH
I analyse the r score:
with(wv, cor.test(as.numeric(volatile.acidity), as.numeric(pH), method = 'pearson'))
I now compute r score with normalised data(by taking the log) as follows:
with(wv, cor.test(as.numeric(quality), log10(as.numeric(pH)), method = 'pearson'))
with(wv, cor.test(as.numeric(quality), log10(as.numeric(volatile.acidity)), method = 'pearson'))
with(wv, cor.test(as.numeric(quality), log10(as.numeric(residual.sugar)), method = 'pearson'))
with(wv, cor.test(as.numeric(quality), log10(as.numeric(chlorides)), method = 'pearson'))
with(wv, cor.test(as.numeric(quality), log10(as.numeric(citric.acid)), method = 'pearson'))
with(wv, cor.test(as.numeric(quality), log10(as.numeric(density)), method = 'pearson'))
with(wv, cor.test(as.numeric(quality), log10(as.numeric(sulphates)), method = 'pearson'))
with(wv, cor.test(as.numeric(quality), log10(as.numeric(total.sulfur.dioxide)), method = 'pearson'))
with(wv, cor.test(as.numeric(quality), log10(as.numeric(free.sulfur.dioxide)), method = 'pearson'))
with(wv, cor.test(as.numeric(quality), log10(as.numeric(fixed.acidity)), method = 'pearson'))
with(wv, cor.test(as.numeric(quality), log10(as.numeric(alcohol)), method = 'pearson'))
The acidity(wv\(pH), sweetness(wv\)residual.sugar), alcohol content(wv$alcohol) are the main driving factors to understanding and assessing the quality of wine. The correlation scores I obtained are as follows:
Features of interest: pH: 0.099 residual.sugar: -0.097 alcohol: 0.436
Other features: volatile.acidity: -0.194723 fixed.acidity: -0.1136628 chlorides: -0.2099344 citric.acid: -0.009209091 density: -0.3071233 sulphates: 0.05367788 total.sulfur.dioxide: -0.1747372 free.sulfur.dioxide: 0.008158067
Correlation is an effect size and so we can verbally describe the strength of the correlation using the guide that Evans (1996)(http://www.statstutor.ac.uk/resources/uploaded/pearsons.pdf) suggests for the absolute value of r: .00-.19 “very weak” .20-.39 “weak” .40-.59 “moderate” .60-.79 “strong” .80-1.0 “very strong”
After analysing the r values, we see that density, alcohol and chlorides have the highest r values. I plotted the quantile graphs for these features and they reflect thee correlation shown by the r values. The graph plotting quality with alchol shows that the alcohol content tends to go higher as the quality of wine improves, especially once we look at wines with a quality rating of >6. Looking at the relationship between the density of wine and quality, there is negative correlation with the density values slighty reducing as the wine quality increases. And lastly, I analysed the relationship between quality and chlorides and we see another negative correlation albeit very subtle, with chloride content going down in higher quality wines. As we can see although our predisposition of pH, residual sugar and alcohol content being our primary features for determining quality, they have weak r scores and the graphs that show their relationship confirm that with only minor variations in values which become prevelant only in very high quality wines.
I analysed the relatioships between fixed acidity and volatile acidity and it had an r score -0.022, between fixed acidity and citric acid and it had an r score of 0.289, between volatile acidity and citric acid and it had an r score of -0.149. between free sulfur dioxide and total sulfur dioxide and it had an r score of 0.615. I also tested the relationship between the pH values and fixed and volatile acidity and found a relatioship between pH and fixed acidity at -42.5%.
Based on this analysis and the output of the graphs, we can see that there is a strong relationship between the free sulfur dioxide and total sulfur dioxide values; and ph with fixed acidity.
The strongest relationships to quality were:
Alcohol: 43.6% Density: -30.7% Chlorides(log10): -27.2% Volatile Acidity: -19.4% Total Sulfur Dioxide: -17.4% Fixed Acidity: -11.3%
To dive further into this analysis, I re evaluated my r score with the log10 values to see if there were any stark differences after normalizing the data. The only stark difference came in the correlation with chloride values which jumped from -20.9% to -27.2%
Analysis of quality with alcohol content and density
Analysis of quality with pH and residual sugar
Analysis of quality with alcohol content and residual sugar
Analysis of quality with sulphates and density
Analysis of quality with total.sulfur.dioxide and fixed.acidity
I ran an analysis to see if by combining alcohol content and density (since they have higher correlation factors to the quality of wine) I’d see some results. Looking at the results I see that the alcohol content generally increased and the density generally decreased as the quality of the wine improved.
I ran some tests to check if our other predispositioned features, pH and residual sugars could give us some insight. But upon generating those graphs and facet wrapping with our new variable ‘rating’, it was still hard to say if these features were significant contributing factors.
After this I ran tests on some of the other features which did not traditionally affect wine quality based on literature and which didn’t have significant r values of correlation with quality. Tests comparing alcohol content and residual sugar to quality didn’t give much insight except for subtle inferences that residual sugars tend to go slightly down in higher quality wines. I performed similar analysis with sulphates and density over quality; and with total.sulfur.dioxide and fixed.acidity over quality. Both did not produce enough of a trend to show correlation.
Alcohol and Chlorides
One of the driving factors into appreciating wine quality turns out to be the alcohol content in wine and their corresponding chloride content which is the amount of salt in the wine. Good wines tend to have a higher percentage of alcohol and a lower chloride content as can be seen in this graph.
Density and Total Sulfur Dioxide
Good quality wines tend to have a low density and low total sulfur dioxide component in them. Total sulfur dioxide is the amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine and is not a preferable quality to have in good wines.
Acidity in wine
Acidity plays a big role in the taste of the wine. However most wines, irrespective of their quality have a pH in the [3,4] range. The fixed acidity (tartaric acid) are the nonvolatile acids that do not evaporate and higher values in better quality wines. The volatile acidity which is the acetic acid and are the acids that can evaporate have lower values in good quality wines. Citric acid, which affects the freshness and flavor of wine is found in extremely small quantities and is non determinate factor in the quality of wines. All in all, the acidity of wines play a subtle but important role in a good quality wine!
When performing analysis I have come to realise that having an understanding of the underlying data itself plays a key role in extracting useful inferences. In the case of this dataset, having an understanding of the components of wine and what are the key factors that affect the taste of wine as well as their quality was key. I spent some time reading about how wine is made and what are the subtle nuances that go into making different kinds of wine. There were quite a few difficulties I ran into while doing this analysis. For instance I expected acidity to have a much greater affect on the taste of wine than it actually turned out to have. And from that point on I was essentially testing all the other features to understand where correlationns may lie. This is a feesible process when there are a finite number of features such as 12 in this case and can be much harder once the number of features go up. On the other hand, one thing that greatly helped was the pearson r score. This score helped narrow down my analysis greatly and I was able to go deeper into some of the more relevant features.
Being a wine sommelier is a hard task but analysis’ like these can greatly help. I am curious to perform such tests on other spirits like beer and whiskey as well.